home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.cs.arizona.edu
/
ftp.cs.arizona.edu.tar
/
ftp.cs.arizona.edu
/
icon
/
newsgrp
/
group98c.txt
/
000022_icon-group-sender _Mon Sep 14 08:24:36 1998.msg
< prev
next >
Wrap
Internet Message Format
|
2000-09-20
|
14KB
Return-Path: <icon-group-sender>
Received: from kingfisher.CS.Arizona.EDU (kingfisher.CS.Arizona.EDU [192.12.69.239])
by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) with SMTP id IAA06355
for <icon-group-addresses@baskerville.CS.Arizona.EDU>; Mon, 14 Sep 1998 08:24:35 -0700 (MST)
Received: by kingfisher.CS.Arizona.EDU (5.65v4.0/1.1.8.2/08Nov94-0446PM)
id AA01514; Mon, 14 Sep 1998 08:24:08 -0700
From: gep2@computek.net
Date: Sat, 12 Sep 1998 14:19:41 -0500 (CDT)
Message-Id: <199809121919.OAA18685@mail.cmpu.net>
Mime-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Subject: Re: Unicode support or support for non-Ascii based character
manipulation?
To: icon-group@optima.CS.Arizona.EDU
X-Mailer: SPRY Mail Version: 04.00.06.17
Content-Transfer-Encoding: 7bit
Content-Transfer-Encoding: 7bit
Errors-To: icon-group-errors@optima.CS.Arizona.EDU
Content-Transfer-Encoding: 7bit
Status: RO
>> Okay, I don't dispute that this move is happening but personally I still
don't very much like it. The fact is that (at least here in the Western
Hemisphere, where probably most of the world's computers are used) an eight-bit
byte is already quite sufficient for most purposes, and doubling it comes at a
cost in complexity and storage (RAM, disk, tape, whatever) which is simply very,
very hard to justify on any genuine economic basis.
> This is a fictitious problem.
Which? Most of the points there are not subject to dispute, at least for most
of us here in the USA.
a) That I don't very much like it?
b) That most of the world's computers are used in the Western Hemisphere?
c) That an eight-bit byte is quite sufficient HERE for most (I didn't say
ALL) purposes?
d) That doubling it to a sixteen-bit byte comes at a cost (I didn't say a
HUGE cost, but it IS a cost) in complexity and storage?
e) That such a cost is hard to justify (again, for MOST purposes, in
particular for business and most typical home use) given the limited or only
specialized need for a bunch of exotic characters that probably 95% of the
Western world's PC users are likely to never use?
> UNIX systems at least...
...which represent something like 4% of machines sold, and it looks like NT 5.0
will continue to erode corporate use of Unix...
> ...support UTF-8, which is a compression method
described in ISO 10646 and the Unicode book that has the property
that ASCII characters *still* occupy exactly one byte each.
Okay, but this still results in more complex file formats and the need for
suitable compression and decompression routines, and/or the use of mixed-mode
processing in handling strings and/or doubling storage requirements for such
strings while they are in memory (and thus obsoleting a lot of existing tools,
library routines, and other programming). We've already talked about some of
the issues regarding Icon implementation, and while probably not insurmountable
(indeed, I think that a fully Unicode-supporting Icon implementation... NOT to
replace the normal one!!... might be a very popular tool among those people who
for whatever reason decide to use Unicode.)
> When I use getwc() on this system, it decodes UTF-8 files and gives me
ISO 10646 wide characters internally.
Which means I presume that those characters internally take twice the storage
they would otherwise. Thus at a cost of storage, and with the disadvantage that
(barring some kind of new machine architecture at least where there is a NATIVE
16-byte byte I suppose, without direct addressability to address increments
smaller than that) programming must change to account for the fact that all
bytes are now byte PAIRS and that alignment issues suddenly become of prime
importance.
>> If other countries have more difficult (or huge) character sets,
that is (while a fact of life) simply an inherent disadvantage
of their culture (and note that I'm not intending that as a slam
or value judgement, it just IS the way it is), and I don't see a
terribly convincing argument why the other countries (without
that disadvantage) ought to pay the price too, just in order to
artificially level the playing field.
> Many people _within_ Weestern Europe do not have the luxury of dealing
with only a single language.
Sure, but I'll point out that the great majority of them (and here I'm talking
about typical business and home users, I'm not talking about academic types who
ABSOLUTELY have to have a whole assortment of Armenian, Sanskrit and other
highly specialized fonts for their scholarly work) do rather okay with the
systems they're presently using.
> I cannot write my father's name in ASCII, nor my sister-in-law's. Both of
them are (in my father's case, were) monoglot Anglophones born into monoglot
Anglophone families in an English-speaking country. I _can_ write their names
in ISO Latin-1, but I _can't_ write half of the place-names of this country!
I note that you don't mention WHICH country you're talking about.
Of course, I suppose I could buy an island somewhere and name it some new name
using some bizarre alphabet, and then ask everyone in the world to adjust all
their systems to support my new alphabet!
When most immigrants came to the USA during the latter half of the previous
century (and the first twenty or so years of this one) a LOT of them changed the
spelling and writing of their names. Hey, I can't address a letter to
Peking/Beijing/whatever from my computer these days using the *REAL* name of the
city, spelling the name the way the local residents do, either. Even among
Western countries, a Parisian sending a letter to London will usually address it
as "Londres", and most Americans writing to a friend in Cologne, Germany will
address it that way rather than "Koln" (yeah, I know that they put the
double-dot over the "o" too). But you know something? All of those letters
WILL be delivered just fine to the recipients in Beijing, London, or Cologne,
because we NORMALLY deal (and generally reasonably well) with these differences
of the way that different world peoples call each other's countries. Not just
when the names are different, but also when the alphabets are different. I'm
sure I could write a letter to someone in an Arab country using a Western,
non-Arab alphabet and still get it delivered. Despite the fact that locally
written letters are doubtless addressed in Arabic. The post office there can
handle BOTH (and better, I'm sure, than the US post office could deal with a
letter addressed to someone HERE in Arabic!).
> (The officially approved orthography for Maori puts a macron over
long vowels, like the 'a' in Maori. There are no macrons in Latin-1.)
Even if my text switched between Latin-1 family members, I _still_
wouldn't be able to write English, because the inverted comma and
and double inverted comma quotation marks are not available, let
alone en dashes and em dashes.
Frankly, I think the double quote and apostrophe work just fine for most people.
So to say that you "can't write English" is fairly ridiculous. In fact, what
will probably happen is that these archaic inconveniences will probably simply
fade away, due precisely to the fact that they aren't widely supported and most
people simply couldn't care less.
> The *only* character set around in which this functionally-monoglot
Anglophone can write *in English* about the people and places around
him is ISO 10646; even Latin-1 just isn't good enough FOR ENGLISH!
Frankly, I think that the great majority of your audience will probably do just
fine with a "close approximation". My neighbor and wonderful friend in Paris
was Russian (in fact, he's on this list... HI Vlad!) but he didn't seem to be
terribly upset that he couldn't write his name there spelled using the Cyrillic
characters he'd grown up with. What's important for most people is that they
communicate successfully with the people that are important to them, and most of
the time we do that pretty well.
Frankly, if you told most Americans that they weren't writing proper English
because they didn't use inverted commas and double inverted comma quotation
marks, or properly use en dashes and em dashes, I suspect that they'd look at
you with disbelief as if you were from Mars or something, and tell you to get a
life.
> I also note that Icon (like SNOBOL before it) has been of particular
interest to scholars in the humanities, who would, for example, like
to put Hebrew _and_ Arabic in the same document with English, which
is something you can't do in any ISO 8859 family member, not without
code switching, which is much harder to deal with than Unicode.
Obviously scholars who worry about such issues have a variety of specialized
word processors and other such software to deal with their multi-lingual,
multi-alphabet requirements (and that's as it should be, probably). Again, as
I've mentioned in other posts, there are a whole series of issues that go way
beyond simply having enough characters in the character set.for "everyone's"
characters to be there in direct, native mode. Some languages write
right-to-left in horizontal rows (Hebrew for example), and some languages write
top to bottom and then to the left in vertical rows (Japanese for instance).
Trying to mix these styles in the same document and on the same line is complex
at minimum and very frustrating for typical users (when using such word
processors, the simple use of the left and right arrow keys to move the cursor
certainly doesn't obey the "principle of least astonishment" as it's known to
most of us!).
> There is the pretty obvious point that within Europe, they are going
to *have* to use the new "Euro" sign. (Why have the Europeans
named their new currency after an Australian mammal?) That's U+20AC,
and if there's an 8-bit character set that has it, please tell us which.
You're being ridiculous, since OBVIOUSLY they have created a NEW character
EXPRESSLY for the purpose of it being new. Clearly it's not part of *any*
previously-existing character set. (For that matter, it wasn't part of Unicode
EITHER before they created it and got it added).
Even once the character is added officially to the CHARACTER SET, even that
doesn't really begin to solve the problem. Because now you have to address the
issue of how you're going to ENTER it (keyboard?), and how you're going to
DISPLAY it. There are (at least!) tens of thousands of fonts out there, and
*none* of them will have these newly-created characters in them. I'd hate to
even think of a TrueType font for "all" of Unicode's characters. Let alone a
full set of fonts for all the different type styles and variants. These fonts
(for those of us that tend to collect a lot of them) take up too much space on
hard disks as it is.
>> I can certainly understand and appreciate the problems that the huge
character sets used in some eastern countries have played for them
> Never mind eastern countries. What about an American businessman writing
to an office in Germany about their operations in Russia?
Straw man. These communications take place just fine today, without using
Cyrillic.
> What about a
theologian writing in English but quoting Hebrew and Greek frequently?
That's of academic interest but (HIGHLY specialized) academic needs should NOT
force businesses and typical home users to pay more to support the needs of a
VERY small percentage (at least until you get REAL far away) of other users.
> What about an English professor writing a book in modern English about
Old English (we've lost four letters, which can be found in Unicode
but not any 8-bit character set I know of. Ash _is_ in Latin1, but
eth, thorn, yogh, and wynn are not.)
Again, most of us could care less. He's (or she's) welcome to deal with that
issue however they like. The current system has NOT precluded such scholarly
research up to now, so I don't see why this is such a big issue all of a sudden.
> By the way, 16 bits isn't enough; there are proposals already far advanced
in the pipeline for characters to go into Plane 1.
And that starts to get even more ridiculous. As I said, it's a slippery slope
when you decide that everyone has to be able to support EVERYBODY else's needs,
even when for most people they are TOTALLY IRRELEVANT. I would imagine that
someone has even assigned "official" Unicode character assignments to Klingon
characters! So are OTHER people going to start dreaming up their own weird
alphabets and asking the rest of the world to jump through hoops supporting
those, too?
Frankly, I'm never going to need to read (OR WRITE!) Armenian. I'm even
unlikely to read or write most Asian languages, or Hebrew, or numerous others
which are important to many people SOMEWHERE on the globe. And frankly, I think
most of my consulting clients' needs are served just fine by "normal" ASCII. It
is ludicrous to expect them to put up with extra cost and complexity in their
business to support something that they don't need, don't want, and in fact
would have *no* use for whatsoever.
People who DO have special requirements (and I'm not disputing that there ARE
such persons) should, alternatively, EXPECT to deal with the extra costs and the
additional hassles that their special needs demand.
Gordon Peterson
http://www.computek.net/public/gep2/
Support the Anti-SPAM Amendment! Join at http://www.cauce.org/